Pokémon battles aren’t just for trainers anymore — AI models are now duking it out in the digital wilds of Kanto. But what started as a quirky competition between Google’s Gemini and Anthropic’s Claude has stirred up a surprisingly serious conversation about the validity of AI benchmarks.
A recent viral post on X claimed Gemini had outpaced Claude in the original Pokémon Red/Blue trilogy, reaching Lavender Town while Claude remained stuck in Mount Moon. The livestream showcasing Gemini’s gameplay was framed as proof of its dominance.
“Gemini is literally ahead of Claude atm in Pokémon…” the post reads, with a clip of the AI-led journey through the classic Game Boy title.
But Reddit users were quick to cry foul: the Gemini stream had a leg up. The developer behind it had implemented a custom minimap — a tool that made it easier for Gemini to parse game tiles like trees or obstacles, effectively reducing its reliance on screen-based perception and giving it a strategic edge.
It’s Just Pokémon… Or Is It?
While few people seriously consider Pokémon a rigorous AI benchmark, the moment highlights a much larger issue: benchmarks are only as fair as their implementation.
Take Anthropic’s own Claude 3.7 Sonnet, which scored 62.3% on SWE-bench Verified (a test of coding skill) — until a “custom scaffold” bumped its performance to 70.3%. And Meta? They fine-tuned Llama 4 Maverick to shine on LM Arena, a benchmark where the base model struggled.
In short: these tests are fragile. Minor tweaks — even non-transparent ones — can drastically alter the results.
The Benchmarking Arms Race
As model releases speed up, developers are finding more ways to optimize for benchmarks rather than real-world performance. That creates a moving target and blurs what “better” really means. Even something as innocent as a Pokémon showdown reveals how subjective — and at times, manipulated — AI evaluation can become.
The takeaway? Whether it's Lavender Town or leaderboard bragging rights, how an AI gets there matters just as much as where it ends up.